Dna Capitals Hackathon

Summary

The aim of this hackathon is to extract the specific information about a company. The information has to be extracted using the training sample of company names and at the end of hackathon it is validated against test set of companies.

Project Links

Source Code

Description

The project is divided into two division

Company Info Crawler - Ruby on Rails application to extract generic information from the target website.
Web Scrapper - Python Script with beautifulSoup to extract specific information from the target website.

WebScrapper

BeautifulSoup is used to extract specific information about companies from open source websites (wikipedia, pitchbook and bloomberg). The company website is identified by extracting website names from various browsers using different combinations of keywords and then the most relevant website is chosen. The information scrapped are

Sector to which the belongs
About the company
Contact details
Investors of the company
News articles about the company

Company Info Crawler

The website obtained from Web Scrapper is used to extract generic information about the company. The content extracted from the website using HttParty is parsed using Nokogiri and indexed using Elastic Search. This process is carried out as background task using sidekiq. Most informational content for a query is returned using context similarity (LSI, Carrot Clustering). Latent Dirichlet allocation is used to extract the topics from the content indexed based on the query

Meta Data of a Company

Query Results

Appreciations

Won the first prize at the Hackathon

Spatio Temporal Analysis Of Students’ Travel

← Previous project

Medtracker

Next project →